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1  Introduction 

This  paper  presents  the  first  participation  of  the 
Luxembourgish  Public  Research  Center  Henri 
Tudor  in  the  TREC  2014  Clinical  Decision  Sup¬ 
port  (CDS)  Track.  At  the  Resource  Centre 
for  Healthcare  Technologies  (SANTEC)  depart¬ 
ment,  we  focus  our  research  activities  on  health¬ 
care  technologies.  Our  mission  consists  primarily 
in  improving  healthcare  by  developing  methods, 
tools,  services  and  solutions  that  can  be  applied 
by  healthcare  professionals,  patients  and  citizens 
on  a  daily  basis. 

In  this  research  work,  we  present  an  approach 
to  combining  search  results  using  data  fusion 
techniques.  The  focus  of  the  2014  Clinical  Deci¬ 
sion  Support  Track  was  the  retrieval  of  relevant 
biomedical  articles  for  answering  generic  clinical 
questions  about  medical  records.  Each  question 
consists  of  a  case  report  and  one  of  three  generic 
clinical  question  types,  such  as  “What  is  the  pa¬ 
tient’s  diagnosis?” .  Retrieved  articles  are  judged 
relevant  if  they  provide  information  of  the  spec¬ 
ified  type  that  is  relevant  to  the  given  case. 

The  remainder  of  this  report  is  organized  as 
follows.  Section  [2]  gives  a  brief  description  about 


the  CDS  Task.  Section  [3]  describes  our  method¬ 
ology  for  combining  search  results.  Our  submit¬ 
ted  runs  and  the  official  results  of  TREC  CDS 
Track  are  described  in  Section]!]  Finally,  Section 
[5]  draws  our  conclusions  and  outlining  directions 
for  future  work. 

2  Overview  of  the  CDS  Task 

The  TREC  Clinical  Decision  Support  (CDS) 
Track  investigates  the  performance  of  systems 
that  search  a  static  set  of  documents  obtained 
from  short  case  reports,  such  as  those  published 
in  biomedical  articles,  as  idealized  representa¬ 
tions  of  actual  medical  records.  The  goal  of  the 
task  is  to  rank  the  documents  in  the  collection 
in  decreasing  probability  of  relevance. 

2.1  Document  collection 

The  target  document  collection  for  the  track 
is  the  Open  Access  Subset  of  PubMed  Central 
(PMC).  PMC  is  an  online  digital  database  of 
freely  available  full-text  biomedical  literature. 
Because  documents  are  constantly  being  added 
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Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 
VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 
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to  PMC,  to  ensure  the  consistency  of  the  collec¬ 
tion,  we  obtained  a  snapshot  of  the  open  access 
subset  on  January  21,  2014,  which  contained  a 
total  of  733,138  articles.  The  full  text  of  each 
article  in  the  open  access  subset  is  represented 
as  an  NXML  file  (XML  encoded  using  the  NLM 
Journal  Archiving  and  Interchange  Tag  Library), 
and  images  and  other  supplemental  materials  are 
also  available. 

Each  article  in  the  collection  is  identified  by 
a  unique  number  (PMCID),  which  is  specified 
by  the  <article-id>  element  within  each  article’s 
NXML  hie. 

2.2  Topics 

The  topics  for  the  track  are  medical  case  nar¬ 
ratives  created  by  expert  topic  developers  that 
serve  as  idealized  representations  of  actual  med¬ 
ical  records.  The  case  narratives  describe  infor¬ 
mation  such  as  a  patient’s  medical  history,  the 
patient’s  current  symptoms,  tests  performed  by 
a  physician  to  diagnose  the  patient’s  condition, 
the  patient’s  eventual  diagnosis,  and  finally,  the 
steps  taken  by  a  physician  to  treat  the  patient. 

There  are  many  clinically  relevant  questions 
that  can  be  asked  for  a  given  case  narrative.  In 
order  to  simulate  the  actual  information  needs  of 
physicians,  the  topics  are  annotated  according  to 
the  three  most  common  generic  clinical  question 
types  :  diagnosis,  test  and  treatment. 

2.3  Evaluation  protocol 

The  track  received  a  total  of  102  runs  from  26 
different  groups.  All  the  runs  contributed  to 
the  judgment  sets,  which  were  constructed  to  be 
compatible  with  the  computing  of  the  inferred 
measures.  In  particular,  the  judgment  sets  were 
created  using  two  strata:  all  documents  retrieved 


in  ranks  1-20  by  any  run  in  union  with  a  20% 
sample  of  documents  not  retrieved  in  the  first 
set  that  were  retrieved  in  ranks  21-100  by  some 
runs. 

Documents  in  the  judgment  set  were  judged 
on  a  three-point  scale  of  0:  not  relevant,  1:  pos¬ 
sibly  relevant,  2:  definitely  relevant.  The  evalu¬ 
ation  measures  were  computed  by  conflating  the 
possibly  relevant  and  definitely  relevant  sets  into 
a  single  relevant  set.  The  exception  to  this  is 
the  inferred  NDCG  measure  which  makes  use  of 
the  different  relevance  grades.  The  measures  re¬ 
ported  by  sample_eval  are  inferred  average  pre¬ 
cision  (infAP),  inferred  NDCG  (infNDCG),  in¬ 
ferred  precision  at  10,  50,  and  100  documents  re¬ 
trieved  (iP10,  iP50,  iPlOO),  inferred  number  of 
relevant  retrieved  (inum_rel_ret) ,  inferred  num¬ 
ber  of  relevant  (inurmrel)  and  number  retrieved 
(num_ret) . 

3  Methodology 

In  our  previous  work  |dj[%j>[9 1 ,  we  have  examined 
several  data  fusion  techniques,  term  weighting 
models  and  query  expansion  models.  In  this 
work,  we  propose  another  strategy  for  merging 
search  results  into  a  list  of  results.  We  first 
present  the  document  indexing  and  retrieval  ar¬ 
chitecture  (Section |3.1[)  and  then  we  present  our 
approach  for  combining  results  (Section  |3.2[). 

3.1  Document  indexing  and  retrieval 

Our  indexing  and  retrieval  framework  is  based 
on  an  open  source  search  engine,  which  has  been 
widely  used  for  research  in  IR.  More  specifically, 
we  used  the  Terrier  IR  platform  for  indexing  and 
retrieving  documents  in  the  collection  jlO | . 

The  indexing  aims  to  organize,  structure  and 
store  statistical  and/or  linguistic  information 
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about  terms  and  documents  in  the  collection 
allowing  a  rapid  and  efficient  search.  During 
the  indexing  stage,  stop- words  are  removed  from 
documents  before  stemming  using  the  Porter  al¬ 
gorithm 
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The  document  retrieval  aims  to  match  the  user 
query  and  document  representations  in  order  to 
retrieve  a  list  of  results  that  may  satisfy  the  user 
information  need.  A  document  D  containing 
terms  used  for  formulating  a  query  Q  is  weighted 
by  summing  the  score  of  each  term  appearing  in 
document  D : 


RSV(D ,  Q)  =  scoreit  £  D)  (1) 

t£Q 

where  score(t  £  D)  is  the  query  term  weight 
calculated  using  a  particular  term  weighting 
model.  For  evaluating  the  performance  of  cur¬ 
rent  state-of-the-art  weighting  models,  we  chose 
three  different  term  weighting  models  used  in  our 
experiments,  namely  BM25  JT2] ,  In_expB2  |2j 
and  LGD  |4|.  We  then  applied  several  state- 
of-the-art  pseudo-relevance  feedback  techniques 
using  statistical  measures  such  as  the  Bose- 
Einstein  (Bo)  statistics  [I]  in  order  to  select  most 
related  terms  for  enriching  the  original  query. 


3.2  Document  fusion 

Let  L  =  {Li,  L2, ...,  Lfc}  be  the  set  of  different 
lists  of  documents  returned  by  k  IR  methods, 
where  Li  is  the  result  list  obtained  by  the  IR 
method  i.  Formally: 


term’s  properties  (stopword,  low  IDF,  term 
length),  etc.). 

When  the  result  lists  for  each  query  are  com¬ 
bined  together,  we  build  a  set  of  runs  of  docu¬ 
ments.  Each  run  contains  documents  with  the 
same  ID  (docno)  but  with  different  scores  re¬ 
turned  by  different  IR  methods. 

Inspired  by  our  previous  work  cited  in  |5|| 9], 
we  aim  to  define  novel  combination  techniques 
based  on  several  means. 

CombH  =  — zri — =rr - =t  (harmonic  mean) 

al  ~^~a2  “K--H ~an 

CombG  =  y/ai  *  a 2  *  ...an  (geometric  mean) 

'  Comb  A  =  ai+a2 +-+a"  (arithmetic  mean) 

CombQ  =  ^/aI+a2+---+an  (quadramatic  mean) 

’  ^  (3) 

where  n  is  the  number  of  documents  in  each 
run  and  a,  is  the  score  of  document  Dj. 

4  TREC  CDS  Submissions 

4.1  Run  description 

We  submitted  five  official  runs  to  the  TREC 
CDS  track.  Our  submitted  runs  are  divided  into 
two  groups:  the  first  one  includes  four  automatic 
runs  and  the  second  one  includes  one  manual 
run.  For  each  group  of  runs,  we  aim  to  evaluate 
the  performance  of  our  data  fusion  techniques  in 
comparison  to  state-of-the-art  retrieval  models. 

The  description  of  the  five  submitted  runs  are  as 
follows: 


Li  —  {D-n,Di2, ...,  Diz}  (2) 

where  a  is  the  size  of  the  list  Li  and  D,  j  is 
document  j  belonging  to  the  list  Lj. 

Each  IR  method  can  be  configured  by  a  set 
of  parameters  (e.g.,  the  term  weighting  model, 


•  tudor Combi:  query  terms  are  only  in  the 
description  field  (long  query).  We  ignore 
low  IDF  terms  and  run  the  retrieval  us¬ 
ing  three  IR  models  namely  BM25,  LGD 
and  In_expB2  with  query  expansion  (top 
30  terms  from  top  20  returned  documents) 
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using  the  Bol  model.  Finally,  the  results 
are  combined  using  the  CombA  fusion  tech¬ 
nique  (see  Formula  [3}) 

•  tudorComb2:  this  run  is  different  from  the 
previous  run  in  the  sense  that  query  terms 
are  extracted  only  from  the  summary  field. 

•  tudorCombS:  this  run  is  similar  to  the  first 
run  except  that  we  consider  also  low  IDF 
terms  for  retrieval. 

•  tudorComb4-'  this  run  is  similar  to  the  third 
run  with  the  exception  that  we  consider 
query  terms  in  the  summary  field. 

•  tudorCombm:  the  last  run  is  similar  to  the 
first  run  but  is  classified  as  manual  because 
we  expand  the  original  query  using  long 
forms  of  abbreviations  and  give  a  more  im¬ 
portant  weight  for  terms  denoting  medical 
concepts. 

We  use  the  default  term  processing  pipeline 
in  Terrier:  stop-words  are  removed  from  doc¬ 
uments  and  queries  before  stemming  using  the 
Porter  algorithm.  For  manual  runs,  we  further 
removed  query  terms  that  are  not  present  in  the 
stop-word  list  but  that  we  believe  are  not  quite 
informative.  For  example,  in  the  query  “Patients 
with  hearing  loss” ,  the  term  “with”  is  recognized 
as  a  stop- word  and  is  therefore  automatically  re¬ 
moved  from  documents/queries.  We  compare 
the  combined  results  with  the  baseline  results 
obtained  by  each  of  the  three  IR  models  namely 
BM25,  LGD  and  In_expB2  using  query  expan¬ 
sion  (top  30  terms  from  top  20  returned  docu¬ 
ments)  using  the  Bol  model. 

In  what  follows,  we  present  the  results  of  our 
official  runs  submitted  to  TREC  CDS  2014.  Af¬ 
terwards,  we  present  the  results  obtained  unof¬ 


ficially  on  the  small  set  of  documents  that  have 
been  judged  either  relevant  or  irrelevant. 

4.2  Official  results 

Tabled] shows  the  official  results  of  our  runs  sub¬ 
mitted  to  TREC  CDS  2014.  According  to  the  re¬ 
sults,  we  observe  that  there  is  no  significant  dif¬ 
ference  when  using  long  (description)  and  short 
(summary)  queries.  This  is  probably  due  to  the 
poor  performance  of  the  IR  models  on  the  un¬ 
derlying  collection.  Indeed,  the  baseline  IR  per¬ 
formance  is  very  low  :  bpref  ~  0.10,  MAP  ~ 
0.05,  inf  AP  ss  0.09,  inf  NDCG  «  0.11  and 
P10  «  0.25,  iPIO  «  0.18. 

We  also  notice  that  the  combined  results  out¬ 
perform  the  results  obtained  by  the  baseline, 
i.e.  without  combination.  For  example,  the 
inf  NDCG  of  tudorComb2  run  is  0.1640  which 
is  quite  better  than  the  best  baseline  LGD+Pol 
(inf NDCG  =  0.1175)  with  an  improvement 
rate  of  +39.57%.  There  is  also  an  improvement 
rate  of  +38.19%  in  terms  of  iPIO  when  com¬ 
paring  run  tudorCombl  (iPIO  =  0.2533)  and 
the  best  baseline  BM25+Bol  (iPIO  =  0.1833). 
Therefore,  the  results  show  the  evidence  of  com¬ 
bining  results  for  improving  the  IR  performance. 

4.3  Performance  analysis 

We  study  the  performance  of  different  data  fu¬ 
sion  techniques  for  combining  search  results.  Ta¬ 
ble  [2]  depicts  the  IR  results  obtained  on  the 
pool  of  documents  that  have  been  judged  ei¬ 
ther  relevant  or  not  relevant.  Here,  we  only 
use  documents  that  have  been  judged,  i.e. 
29,969  documents,  assuming  that  all  documents 
in  the  collection  must  be  judged.  The  re¬ 
sults  confirm  that  the  arithmetic  mean  yields 
the  best  performance  with  P10  =  0.2967  and 
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— -.Performance 

Run 

bpref  MAP 

P10 

infAP  infNDCG  iPIO 

Submitted  runs 


tudorCombl 

0.1045 

0.0503 

0.2533 

0.0324 

0.1508 

0.2533 

tudorComb2 

0.1130 

0.0513 

0.2500 

0.0335 

0.1640 

0.2500 

tudorComb3 

0.0320 

0.0120 

0.1167 

0.0100 

0.0655 

0.1167 

tudorComb4 

0.0394 

0.0168 

0.1433 

0.0151 

0.0819 

0.1433 

tudorCombm 

0.1070 

0.0531 

0.2467 

0.0349 

0.1618 

0.2467 

Baseline 


BM25+Bol 

0.1049 

0.0502 

0.2500 

0.0092 

0.1170 

0.1833 

LGD+Bol 

0.1033 

0.0483 

0.2433 

0.0089 

0.1175 

0.1833 

In_expB2+Bol 

0.1163 

0.0486 

0.2467 

0.0077 

0.1017 

0.1700 

Table  1:  IR  effectiveness  obtained  by  each  run  on  the  TREC  CDS  2014  collection.  Bold  numbers 
correspond  to  the  best  performance  of  submitted  run  while  underlined  numbers  correspond  to  the 
best  performance  of  the  baselines. 


iP 10  =  0.2000  w.r.t.  the  other  means  (har¬ 
monic,  geometric  and  quadramatic) .  However, 
even  if  the  document  space  was  dramatically  re¬ 
duced,  i.e.  unjudged  documents  were  removed 
from  the  evaluation,  the  overall  IR  effective¬ 
ness  in  terms  of  MAP,infAP  and  infNDCG 
is  quite  small  (MAP}}est  =  0.1803,  m/AP^  = 
0.0168,  inf  NDCGbest.  =  0.1756).  Also,  the  over¬ 
all  effectiveness  of  our  system  is  close  to  the  per¬ 
formance  of  the  IR  model  B25  in  a  smaller  col¬ 
lection  (only  documents  that  have  been  judged). 

Figure  [l]  depicts  the  percentage  of  relevant  vs. 
irrelevant  documents  that  have  been  judged  ei¬ 
ther  relevant  or  irrelevant.  We  observe  that  for 
each  query,  there  are  a  high  number  of  irrelevant 
documents  in  comparison  to  relevant  documents. 

A  first  analysis  of  false  positives  showed  that 
some  irrelevant  documents  were  well  ranked  by 
our  system  due  to  irrelevant  keywords.  For  ex¬ 
ample,  the  word  ’’trip”  in  the  query  S0  led  to 

1  Diagnosis.  8- year-old  boy  with  2  days  of  loose  stools 
fever  and  cough  after  returning  from  a  trip  to  Colorado 


several  irrelevant  documents.  In  future  work, 
It  would  be  interesting  to  investigate  whether 
the  semantic  analysis  of  topics  could  improve 
the  performance  of  our  system.  We  plan  to  use 
NLP  techniques  to  analyze  queries  semantically 
(e.g.  |3|)  in  order  to  give  different  weights  to  key¬ 
words  according  to  their  importance  w.r.t.  the 
focus  of  the  query. 

5  Conclusion 

We  presented  our  participation  to  the  Clinical 
Decision  Support  track  in  TREC  2014,  which  is 
a  biomedical  retrieval  ad  hoc  task.  The  underly¬ 
ing  IR  platform  of  our  experiments  is  the  Terrier 
search  system.  Our  participation  focused  on  the 
evaluation  of  several  IR  models  for  term  weight¬ 
ing  as  well  as  state-of-the-art  query  expansion 
models.  We  have  also  evaluated  several  combi¬ 
nation  techniques. 

Chest  x-ray  shows  bilateral  lung  infiltrates. 
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Performance 


Comb. 


bpref  MAP  P10  infAP  infNDCG  iPIO 


CombAVGA 
(tudorCombl ) 

0.1902 

0.1803 

0.2967 

0.0168 

0.1756 

0.2000 

CombAVGG 

0.0848 

0.0456 

0.0400 

0.0043 

0.0821 

0.0833 

CombAVGH 

0.1910 

0.1443 

0.2400 

0.0104 

0.1211 

0.1267 

CombAVGQ 

0.0848 

0.0456 

0.0400 

0.0043 

0.0821 

0.0833 

Baseline 


BM25+Bol 

0.1965 

0.1846 

0.3233 

0.0179 

0.1869 

0.2033 

LGD+Bol 

0.1852 

0.1728 

0.3133 

0.0174 

0.1832 

0.2067 

In_expB2+Bol 

0.1955 

0.1828 

0.3000 

0.0161 

0.1737 

0.1933 

Table  2:  IR  effectiveness  obtained  by  each  of  the  combination  techniques  on  the  judged  documents 
(either  relevant  or  irrelevant)  extracted  from  TREC  CDS  2014  collection. 


Figure  1:  Some  statistics  about  the  documents  submitted  by  all  participants.  Those  documents 
are  then  merged  into  a  pool  and  judged  by  TREC  as  relevant  or  irrelevant. 
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The  combination  of  search  results  showed  an 
improvement  of  the  IR  performance  for  large 
document  collections. 

In  future  work,  we  aim  to  investigate  the  ade¬ 
quate  linguistic  features  extracted  from  relevant 
documents  in  order  to  better  promote  relevant 
documents.  For  example,  we  can  study  the  se¬ 
mantic  similarity  between  relevant  documents 
and  derive  an  IR  model  to  rank  documents  based 
on  their  pairwise  semantic  similarity. 
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